Version 1.0 | First Created Nov 4, 2023 | Updated Nov 15, 2023
Keywords: Transportation Planning, Rideshare Forecast, Spatial and Serial Autocorrelation, Nested Tibble, OLS Regression, time lag, gganimate, purrr
Bikeshare programs have become an integral part of urban transportation system, providing a sustainable and convenient alternative to traditional commuting methods. Citi Bike is one of the earliest bicycle sharing system created by Lyft in 2008 at NYC. But in recent years, this program also extends to Jersey City across the Hudson River in order to cater to the dynamic mobility needs of residents who had to commute to Manhattan for work. As the popularity of Citi Bike continues to soar, there arises a pressing need for effective management strategies to ensure the availability of bikes at each station in Jersey City.
Bikesharing (and ridesharing) inherently involves the challenge of maintaining a delicate equilibrium in bike distribution, ie. “re-balancing”. This issue arises from the spatial and temporal variations in user demand. If certain stations experience a surge in demand during morning rush hours while others witness a decline, this could result in station clusters with depleted or surplus bike stocks, thereby compromising user experience. Companies like Uber & Lyft generate and analyze tremendous amounts of data to incentivize bike (ride) share use, solve routing problems, calculate pricing, and obviously forecast demands, but they are not very transparent about the algorithm they use. Therefore, the goal of this report is keep a simple, but open source version of bikeshare demand prediction, by harnessing of power of machine learning, particularly the Ordinary Least Square (OLS) regression to rapidly predict the demand for bikes across various stations in Jersey City. We will perform analysis on the historical data on bike usage at different stations and different time as well as under different weather conditions. The forecast allows the Citi Bike program to proactively address imbalances, strategically re-positioning bikes in anticipation of peak usage time or special events.
The subsequent sections on this report is organized as follows. Firstly, we load in our datasets for a quick examinations and perform some necessary feature engineering. Secondly, we conduct exploratory analysis on the temporal and spatial aspects of bikeshare data. We will also look at some weather data in this process. Thirdly, we create a space-time panel where each row is a unique instance of time-space combination of rides and visualize the correlation between space, time, weather, and number of rides. And finally, we make created four different models, test their sensibility to errors, and conduct cross validation.
The GitHub repository for this report is here.
The data we use here is obtained from NYC OpenData that contains the latest record of Citi Bike usage on a monthly base. Note that while Jersey City is located in New Jersey, Citi Bike is mainly operated in NYC and therefore its dataset is also operated by NYC. We selected all bikeshare data in Jersey City from Aug 1 2023 to Aug 31 2023 as our temporal frame. The dataset mainly includes locations and names of stations of which the user started the ride as well as end the ride. In addition, we also downloaded the census tract geometries of the entire Hudson County via tidycensus, which includes Jersey City.
nycbike <- read.csv(url("https://raw.githubusercontent.com/emilyzhou112/MUSA5080-NJ-BikeShare/main/JC-202308-citibike-tripdata.csv"))
njCensus <- get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001"),
year = 2021,
state = "NJ",
geometry = TRUE,
county=c("Hudson"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
jcTracts <- njCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf()
Since our raw bikeshare data is a non-spatial layer, despite containing spatial attributes, our next step is to make it spatial by joining it to the census geometry. We are only interested in the start location of each ride because we want to predict the capacity of each station when the user need a bike to start the ride. This requires us to filter out rows with missing latitude and longitude. We also found that some stations have erroneous longitude and latitude coding that lead to these points being located in the Hudson River. These points also need to be removed.
bike_census <- st_join(nycbike %>%
filter(is.na(start_lat) == FALSE &
is.na(start_lng) == FALSE &
is.na(end_lat) == FALSE &
is.na(end_lng) == FALSE) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326),
jcTracts %>% st_transform(crs=4326),
join=st_intersects, left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start_lng = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2)))%>%
filter(!(ride_id %in% c("BEC9CC4D1007C753", "86087F431785EDE7", "A1AAB92631439883", "71938DA085242DCC", "4203C77668C47C75", "C92410CA2AEF3947", "6E02CBF36C90F244", "0259885253FD238E", "B39136B37AFB5FE9", "1EC9376D1559C51C", "CDDF16D3A37BE370"))) %>%
as.data.frame()
The following visualization shows all the steps we did above.
ggplot()+
geom_sf(data = jcTracts %>%
st_transform(crs=4326), fill="white")+
geom_point(data = bike_census, aes(x=start_lng, y = start_lat),
color = "#6E0E0A", alpha = 1, size = 0.3) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
labs(title = "Location of Rides Starting Points in Jersey City") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
We will perform some basic feature engineering using R’s powerful
lubridate package to extract some temporal information for
each ride. This includes calculating the 60min and 15min interval for
the starting time of each ride, and categorizing these intervals into by
the time of day into “Overnight”, “AM Rush”, “Mid-Day”, and “PM Rush”
and by day of week into weekday and weekend.
bike_census <- bike_census %>%
mutate(interval60 = floor_date(ymd_hms(started_at), unit = "hour"),
interval15 = floor_date(ymd_hms(started_at), unit = "15 mins"),
week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
mutate(hour = hour(started_at),
weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"))
Interested in seeing the commuting patterns to get a sense of rider’s daily behavior, we created line strings that connect between the start and end locations of each ride. For the simplicity of visualization, we only visualize 200 rides. We may see that most of the rides occur within Jersey City, but there are also people who commute using Citi Bike from Jersey City to Manhattan.
lines <- st_sfc(
lapply(1:nrow(bike_census), function(i) {
st_linestring(matrix(c(bike_census[i, "start_lng"], bike_census[i, "start_lat"],
bike_census[i, "end_lng"], bike_census[i, "end_lat"]), ncol = 2, byrow = TRUE))
}))
lines_sf <- st_sf(geometry = lines)
lines_sf <- st_set_crs(lines_sf, 4326)
Route <- lines_sf %>% head(200)
mapview(Route, zcol = NULL, color = "#6E0E0A", linewidth = 0.001, alpha = 0.3) # only show 200
In this section, we delve into some further exploratory analysis of our data set.
Starting with temporal analysis, we first visualized the bikeshare trips per hour in Jersey City in August 2023. It is clear that there’s some form of circularity in that there are both peak hours and non-peak hours during a single day. Also, we may see that five consecutive higher peaks are usually followed by two less higher peaks. These correspond to high bikeshare need during rush hours and low bikeshare needs overnight, in early mornings, and over the weekend
ggplot(bike_census %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n), color= "#6E0E0A")+
labs(title="Bike Share Trips Per Hour in Jersey City, Aug 2023",
x="Date",
y="Number of trips") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Then, we visualized the distribution of trips by stations at each hour
in August. We may see that the distribution of rides is highly right
skewed, with most stations having zero or a few number of trips each
hour and a couple of stations having much more rides, up to 40 or even
50 rides at a particular hour on a particular day in August 2023.
ggplot(bike_census %>%
group_by(interval60, start_station_name) %>%
tally())+
geom_histogram(aes(n), binwidth = 5, fill="#6E0E0A")+
labs(title="Bike Share Trips Per Hour by Station in Jersey City, Aug 2023",
x="Trip Counts",
y="Number of Stations") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
We continue to visualize the distribution of mean number of trips at each station during different times of the day. We recognize that during PM rush, the average number of trips has been the highest throughout, with a couple of stations’ average trip exceed 10, implying a relatively high demand for a single station. This is followed by ride demand in mid day, when some stations can have more than 5 rides. It is also surprising to notice that the demand is pretty high overnight, with a few stations having more than 5 rides.
bike_census %>%
group_by(interval60, start_station_name, time_of_day) %>%
tally()%>%
group_by(start_station_name, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips, fill=time_of_day), binwidth = 1)+
scale_fill_manual(values = palette4, name="Time of Day") +
labs(title="Mean Number of Hourly Trips Per Station",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
After that, we visualized the total number of trips in August separated by day of the week. The two main patterns here are that firstly, trip numbers vary by time of the day. Lowest trip counts occur in early morning while highest trip count occur in late afternoon. Secondly, Tuesday, Wednesday, and Thursday tend to observe higher number of trips.
ggplot(bike_census %>% mutate(hour = hour(started_at)))+
geom_freqpoly(aes(hour, color = dotw), binwidth = 3, lwd = 0.8)+
scale_color_manual(values = c("#124E78", "#819FA1", "#F0F0C9", "#F2BB05", "#E58507", "#D74E09", "#6E0E0A"), name = "Day") +
labs(title="Bike Share Trips in Jersey City by Day of the Week, Aug 2023",
x="Hour",
y="Trip Counts") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Same type of visualize but this time separated between weekend and
weekdays. It is very obvious that we see more trips during weekdays and
less trips during weekends. Also, during weekends, number of trips start
to increase much later than during weekdays, suggesting the missing of
morning rush hour will significantly impact people’s use of bikeshare
services.
ggplot(bike_census)+
geom_freqpoly(aes(hour, color = weekend), binwidth = 3, lwd = 1.2)+
scale_color_manual(values=palette2) +
labs(color = "Time of Week") +
labs(title="Bike Share Trips in Jersey City - Weekend vs Weekday, Aug 2023",
x="Hour",
y="Trip Counts") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
Building onto that, we group this dataset by station and time of day to
visualize the total number of trips at each station during a specific
period of time on weekday or weekend. To simplify our visualization, we
selected the top 50 occurrences in our dataset. Note that here, each bar
represent an instance of station-time of day-weekday/weekend
combination. We are able to see that PM rush hour indeed sees that
greatest bikeshare demand for a couple of stations in particular.
Weekend rides only appear three times in the top 50 list. Zooming into
the detail reveals to us that Grove St PATH station
observes a total of 1117 rides during PM rush on a weekday, 801 rides
overnight on a weekday in August 2023. South Waterfront Walkway
- Sinatra Dr & 1 St station observes 319 rides overnight on
a weekend and 731 rides during PM rush on a weekday in August 2023.
to_plot <- bike_census%>% # only show 100 stations
group_by(start_station_id, start_lat, start_lng, weekend, time_of_day, start_station_name) %>%
tally() %>%
arrange(-n) %>%
head(50)
to_plot$ID <- seq_along(to_plot$start_station_id)
to_plot %>%
arrange(-n) %>%
head(50) %>%
ggplot(aes(x = reorder(ID, -n), n, fill = time_of_day, color = weekend)) +
scale_fill_manual(values = palette4, name="Time of Day") +
guides(color="none") +
geom_bar(stat = "identity", position="stack") +
scale_color_manual(values = c("transparent", "black")) +
labs(title="Top 50 Occurences of Rides By Time")+
ylab("Number of Rides") +
xlab("") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
ggplot()+
geom_sf(data = jcTracts %>%
st_transform(crs=4326), fill = "white")+
geom_point(data = to_plot,
aes(x=start_lng, y = start_lat, size=n), color = alpha("#6E0E0A", 0.7)) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
scale_size(range = c(0.5, 8)) +
labs(size = "Number of Rides") +
facet_grid(weekend ~ time_of_day) +
labs(title="Locations of Stations with Greatest Bikeshare Demand by Time")+
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
moran_data <- bike_census %>%
group_by(start_station_id, start_lat, start_lng, start_station_name) %>%
tally() %>%
arrange(-n) %>%
head(80) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)
coords <- st_coordinates(moran_data)
neighborList <- knn2nb(knearneigh(coords, 5))
spatialWeights <- nb2listw(neighborList, style="W")
moranTest <- moran.mc(moran_data$n,
spatialWeights, nsim = 999)
ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
geom_histogram(binwidth = 0.01, fill = "#124E78") +
geom_vline(aes(xintercept = moranTest$statistic), colour = "#6E0E0A" ,lwd=1) +
scale_x_continuous(limits = c(-0.5, 0.5)) +
labs(title = "Observed and Permuted Moran's I for Stations with Most Rides",
subtitle = "Observed Moran's I in Red",
x = "Moran's I",
y = "Count") +
theme_light() +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10))
ggplot() +
geom_sf(data = jcTracts, fill = "white") +
geom_sf(data = moran_data,
aes(size = n, color = n)) +
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
scale_size(
range = c(1, 6),
guide = guide_legend(
direction = "horizontal",
nrow = 1,
label.position = "bottom")) +
scale_color_gradientn(colours = hcl.colors(5, "RdBu",rev = TRUE, alpha = 0.9),
guide = guide_legend(direction = "horizontal", nrow = 1)) +
labs(title = "Stations with Most Rides",
subtitle = "Dot size proportional to rides") +
theme(legend.position = "bottom") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
week33 <-
filter(bike_census , week == 33 & dotw == "Mon")
week33.panel <-
expand.grid(
interval15 = unique(week33$interval15),
Origin.Tract = unique(bike_census$Origin.Tract))
bike.animation.data <-
mutate(week33, Trip_Counter = 1) %>%
right_join(week33.panel) %>%
group_by(interval15, Origin.Tract) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
left_join(jcTracts, by=c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 3 ~ "1-3 trips",
Trip_Count > 3 & Trip_Count <= 6 ~ "4-6 trips",
Trip_Count > 6 & Trip_Count <= 10 ~ "7-10 trips",
Trip_Count > 10 ~ "11+ trips")) %>%
mutate(Trips = fct_relevel(Trips, "0 trips","1-3 trips","4-6 trips",
"7-10 trips","10+ trips"))
bikeshare_animation <-
ggplot() +
geom_sf(data = bike.animation.data, aes(fill = Trips)) +
scale_fill_manual(values = palette5) +
labs(title = "Bikeshare For One Day in Aug 2023",
subtitle = "15 minute intervals: {current_frame}") +
transition_manual(interval15) +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
#animate(bikeshare_animation, duration=40, renderer = gifski_renderer())
anim_save("animation.gif", animation = bikeshare_animation, end_pause = 4, height = 1800, width = 3000, fps = 6, renderer = gifski_renderer())
animate
weather.Panel <-
riem_measures(station = "EWR", date_start = "2023-08-01", date_end = "2023-09-01") %>%
dplyr::select(valid, tmpf, p01i, sknt)%>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(color="#6E0E0A") +
labs(title="Percipitation", x="Hour", y="Perecipitation") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank()),
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(color="#D74E09") +
labs(title="Wind Speed", x="Hour", y="Wind Speed") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank()),
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(color="#124E78") +
labs(title="Temperature", x="Hour", y="Temperature") +theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x=element_text(size=8),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank()),
top="Weather Data ")
Everything below needs re-run
study.panel <-
expand.grid(interval60=unique(bike_census$interval60),
start_station_id = unique(bike_census$start_station_id)) %>%
left_join(., bike_census %>%
select(start_station_id, start_station_name, Origin.Tract, start_lng, start_lat, rideable_type)%>%
distinct() %>%
group_by(start_station_id) %>%
slice(1))
bike.panel <-
bike_census %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start_station_id, start_station_name, Origin.Tract, start_lng, start_lat) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel) %>%
ungroup() %>%
filter(is.na(start_station_id) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE) %>%
left_join(., njCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
bike.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Temperature = first(Temperature)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Temperature, Trip_Count)) +
geom_point(color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Trip Count As a Fuction of Temperature by Week",
x="Temperature", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
bike.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Wind = first(Wind_Speed)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Wind, Trip_Count)) +
geom_point(color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Trip Count As a Fuction of Wind by Week",
x="Wind Speed", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
bike.panel <-
bike.panel %>%
arrange(start_station_id, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24)) %>%
mutate(day = yday(interval60))
as.data.frame(bike.panel) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
ggplot(aes(x = Value , y = Trip_Count)) +
geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~Variable, ncol = 6)+
labs(x = "Variable", y = "Correlation", title = "Correlation Between Serial Lag Trips and Trip Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
lag_result <- data.frame()
for (d in 213:243) {
for (h in 0:23) {
spacelag <- bike.panel %>%
mutate(hour = hour(interval60)) %>%
filter(day == d & hour == h) %>%
st_as_sf(., coords = c("start_lng", "start_lat"), crs = 4326)
coords.test_lag <- st_coordinates(spacelag)
neighborList.test_lag <- knn2nb(knearneigh(coords.test_lag, 5))
spatialWeights.test_lag <- nb2listw(neighborList.test_lag, style="W")
spacelag <- spacelag %>% mutate(lagTrip = lag.listw(spatialWeights.test_lag, Trip_Count))
lag_result <- rbind(lag_result, spacelag)
}
}
bike.panel <- lag_result %>%
mutate(start_lng = unlist(map(geometry, 1)),
start_lat = unlist(map(geometry, 2))) %>%
st_drop_geometry()
bike.panel %>%
group_by(interval60) %>%
summarize(Trip_Count = mean(Trip_Count),
Lag_Count = mean(lagTrip)) %>%
mutate(week = week(interval60)) %>%
ggplot(aes(Lag_Count, Trip_Count)) +
geom_point(size = 1, color = "#6E0E0A") + geom_smooth(method = "lm", se= FALSE, color="#124E78") +
facet_wrap(~week, ncol=8) +
labs(title="Correlation Between Spatial Lag Trips and Trip Count",
x="Lag Trip Count", y="Mean Trip Count") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
bike.Train <- filter(bike.panel, week <= 33)
bike.Test <- filter(bike.panel, week > 33)
# Time
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature + Wind_Speed, data=bike.Train)
# Space
reg2 <-
lm(Trip_Count ~ start_station_name + dotw + Temperature + Wind_Speed, data=bike.Train)
# Time and Space
reg3 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed,
data=bike.Train)
# Time and Space and TimeLag and SpaceLag
reg4 <-
lm(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed +
lagHour + lag2Hours + lag12Hours + lag1day + lagTrip,
data=bike.Train)
bike.Test.weekNest <-
bike.Test %>%
nest(-week)
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
bike.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_Lags = map(.x = data, fit = reg4, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette4) +
labs(title = "Mean Absolute Errors by Model Specification and Week") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id)) %>%
dplyr::select(interval60, start_station_id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start_station_id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
scale_color_manual(values=palette2)+
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed Bike Share Time Series", subtitle = "Jersey City: a test set of 2 weeks", x = "Hour", y= "Station Trips") +
theme(plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.text.y=element_text(size=8),
axis.title=element_text(size=10),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, size=0.8))
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
start_latitude = map(data, pull, start_lat),
start_longitude = map(data, pull, start_lng)) %>%
select(interval60, start_station_id, start_longitude, start_latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_Lags") %>%
group_by(start_station_id, start_longitude, start_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = jcTracts, fill = "white")+
geom_point(aes(x = start_longitude, y = start_latitude, color = MAE),
fill = "transparent")+
scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
labs(title="Mean Abs Error, Test Set, Model 4") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start_station_id = map(data, pull, start_station_id),
start_latitude = map(data, pull, start_lat),
start_longitude = map(data, pull, start_lng),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start_station_id, start_longitude,
start_latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "DTime_Space_FE_Lags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
group_by(start_station_id, weekend, time_of_day, start_longitude, start_latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = jcTracts, color = "grey", fill = "transparent")+
geom_point(aes(x = start_longitude, y = start_latitude, color = MAE),
fill = "transparent", size = 0.5)+
scale_color_continuous(low = "#124E78", high = "#6E0E0A", name= "MAE")+
ylim(min(bike_census$start_lat), max(bike_census$start_lat))+
xlim(min(bike_census$start_lng), max(bike_census$start_lng)) +
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors by Time, Test Set, Model 4") +
theme(axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks =element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
plot.subtitle = element_text(size = 9,face = "italic"),
plot.title = element_text(size = 12, face = "bold"),
panel.background = element_blank(),
panel.border = element_rect(colour = "grey", fill=NA, linewidth=0.8)
)
fitControl <- trainControl(method = "cv", number = 100)
reg.cv <- train(Trip_Count ~ start_station_name + hour(interval60) + dotw + Temperature + Wind_Speed + lagHour + lag2Hours + lag12Hours + lag1day + lagTrip, data=bike.panel, method = "lm", trControl = fitControl, na.action = na.pass)
reg.cv$resample %>%
summarise(MAE = mean(reg.cv$resample[,3]),
sd(reg.cv$resample[,3])
) %>%
kbl(col.name=c('Mean Absolute Error','Standard Deviation of MAE')) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| Mean Absolute Error | Standard Deviation of MAE |
|---|---|
| 0.3696923 | 0.0254276 |